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Abstract 

Photosynthesis by diatoms accounts for roughly one-fifth of global primary production, but despite this, relatively little is known about 
their plastid genomes. We report the completely sequenced plastid genomes for eight phylogenetically diverse diatoms and show 
them to be variable in size, gene and foreign sequence content, and gene order. The genomes contain a core set of 122 protein- 
coding genes, with 1 5 additional genes exhibiting complex patterns of 1) gene losses at varying phylogenetic scales, 2) functional 
transfers to the nucleus, 3) gene duplication, divergence, and differential retention of paralogs, and 4) acquisitions of putatively 
functional recombinase genes from resident plasmids. The newly sequenced genomes also contain several previously unreported 
genes, highlighting how poorly characterized diatom plastid genomes are overall. Genome size variation reflects major expansions of 
the inverted repeat region in some cases but, more commonly, large-scale expansions of intergenic regions, many of which contain 
unique open reading frames of likely foreign origin. Although many gene clusters are conserved across species, rearrangements 
appear to be frequent in most lineages. 
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Introduction 

Diatoms are photosynthetic algae within the large and diverse 
heterokont lineage, which includes brown algae, golden 
algae, and more distantly related nonphotosynthetic taxa, 
including the pathogenic water mold, Phytophthora (a small 
number of diatoms are secondarily nonphotosynthetic too [Li 
and Volcani 1987]). Like cryptophytes, haptophytes, and most 
dinoflagellates, the plastids of diatoms — like all plastid-bearing 
heterokonts — trace their origin to a secondary endosymbiosis 
with a red alga (Archibald 2009). Primary and secondary 
"red" lineages are now principal components of marine 
ecosystems and important contributors to the global cycling 
of carbon and oxygen (Falkowski et al. 2004). Diatoms, in 
particular, are prolific photosynthesizers, responsible for 
roughly 20% of global net primary production (Nelson et al. 
1995). By fixing and exporting massive amounts of carbon 
from the atmosphere to the deep ocean, diatoms are primary 
drivers of the "biological pump" (Hopkinson et al. 2011). 



Their photosynthetic output reflects the vast breadth of their 
ecological and phylogenetic diversity, sheer numerical abun- 
dance, and Form ID Rubisco enzyme, which has an unusually 
high affinity and selectivity for carbon dioxide (Roberts et al. 
2007). Moreover, their photosynthetic products include a 
suite of energy-rich lipids and complex polysaccharides that 
are a primary entry point of carbon into marine food webs 
(Kroth et al. 2008). 

Plastid genome data from primary and secondary red line- 
ages have revealed substantial differences in genome size, 
gene content, and gene order. Compared with their counter- 
parts in the green lineage (green algae and land plants), both 
primary and secondary red plastid genomes tend to have 
more genes, minimal intergenic space, little repetitive se- 
quence, and few if any introns (Green 2011). To date, the 
plastid genomes of seven diatoms and two dinoflagellates 
with diatom-derived plastids have been sequenced. These 
genomes have a moderate gene content (158-162 genes), 
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Table 1 

Culturing, Sequencing, and Assembly Information for the Eight Newly Sequenced Diatom Plastid Genomes 



Taxon 


GenBank Accession 


Culture Collection 


Strain ID 


Growth Medium 


Sequencing Platform 


Sequence Assembler 


Leptocylindrus danicus 


KC509524 


NCMA 


CCMP1856 


F/2 


Roche 454, lllumina HiSeq 


Newbler, ABySS 


Coscinodiscus radiatus 


KC509521 


NCMA 


CCMP310 


F/2 


Roche 454 


Newbler 


Lithodesmium undulatum 


KC509525 


NCMA 


CCMP1797 


F/2 


Roche 454 


Newbler 


Asterionellopsis glacialis 


KC509520 


NCMA 


CCMP1717 


F/2 


Roche 454 


Newbler 


Asterionella formosa 


KC509519 


CPCC 


UTCC605 


COMBO 


Roche 454 


Newbler 


Eunotia naegelii 


KF733443 


UTEX 


FD354 


COMBO 


lllumina MiSeq 


ABySS, Ray 


Cylindrotheca closterium 


KC509522 


NCMA 


CCMP1855 


F/2 


Roche 454 


Newbler 


Didymosphenia geminata 


KC509523 


NA a 


BCC011 


F/2 


lllumina HiSeq 


ABySS, Ray 



Note. — NCMA, Provasoli-Guillard National Center for Marine Algae and Microbiota; CCPC, Canadian Phycological Culture Centre at the University of Toronto; UTEX, The 
Culture Collection of Algae at The University of Texas at Austin 

Environmental sample, Boulder Creek, Colorado, USA, April 2011. 



intermediate between haptophytes and cryptophytes (Green 
201 1 ). Introns are rare, with just one report of an intron in the 
atpB gene of Seminavis robusta (Brembu et al. 2013). Finally, 
unlike their primary red algal progenitors, diatom plastid ge- 
nomes appear to be highly rearranged (Oudot-Le Secq et al. 
2007), even between close relatives (Lommer et al. 2010). 

Diatoms are an extraordinarily diverse lineage (Mann and 
Vanormelingen 201 3), so the small sample of sequenced plas- 
tid genomes has precluded meaningful insights into broad- 
scale patterns of evolution. We sequenced plastid genomes 
for eight diverse diatoms, doubling the number of sequenced 
genomes and filling in several important phylogenetic gaps, 
including taxa that bracket some of the earliest splits in the 
phylogeny. This expanded taxonomic sampling showed that 
diatom plastid genomes are particularly labile in size, structure, 
and sequence content. 

Materials and Methods 

Diatom Cultures, DNA Extraction, and Sequencing 

Culture information, growth conditions, and sequencing strat- 
egies for the eight newly sequenced genomes are summarized 
in table 1 . Didymosphenia could not be cultured, so six indi- 
vidual cells were isolated from a sample collected in Boulder 
Creek, Colorado, USA, and whole-genome amplification was 
performed on each cell using the Qiagen REPLI-g Mini Kit. The 
six amplification products were then pooled for sequencing. 

For Eunotia, we disrupted frozen cell pellets by agitating 
them with glass beads in a Mini-Beadbeater-24 (BioSpec 
Products) before extracting total genomic DNA with the 
Qiagen DNeasy Plant Mini Kit. For the remaining species, we 
isolated plastid DNA by resuspending frozen cells in 10-1 5 ml 
of resuspension buffer (50 mM Tris [pH 8.0], 25 mM ethylene- 
diaminetetraacetic acid, and 50 mM NaCI) and disrupting 
them by nitrogen decompression with a Parr Cell Disruption 
Bomb at 750-800 psi for 20-30 min. Plastids were lysed by 
shaking them at 100rpm for 60 min at 50 °C in a solution 
containing 250 u.1 of 20% Triton X-100 and 1 ml Pronase 



(10mg/ml) per 10 ml of cell slurry. We then added equal 
weight cesium chloride (CsCI) and mixed the slurry until 
the CsCI was fully dissolved and dispensed it into 6 ml PA 
Ultracrimp tubes (Sorvall) with 50 pi of ethidium bromide 
(EtBr) (10mg/ml). After centrifugation at 65,000 rpm in a 
Sorvall TV-1665 rotor for 12 h, we extracted the DNA bands 
and removed EtBr with repeated washes in salt-saturated iso- 
propanol. The spin was repeated with 40 ul Hoechst 33258 
dye (1 0 mg/ml H 2 0). Following the spin, the DNA bands were 
extracted and Hoechst dye removed by repeated 1:1 washes 
with salt-saturated isopropanol. We removed the CsCI by di- 
alysis in TE buffer with buffer changes every 12 h for 48 h. 

We used three different DNA sequencing platforms, indi- 
vidually or in combination, to generate the data (table 1). 
Roche 454 GS-FLX sequencing (Titanium reagents) generated 
500-bp single-end reads and was carried out at the W.M. 
Keck Center for Comparative and Functional Genomics at 
the University of Illinois. The lllumina HiSeq 2000 platform 
generated 100-bp paired-end reads, used libraries of length 
300 bp, and was carried out at the Genome Sequencing and 
Analysis Facility at the University of Texas at Austin. Finally, the 
Eunotia genome was sequenced using the lllumina MiSeq 
platform at the Institute for Genomics and Systems Biology 
at Argonne National Laboratory, using a 300-bp library and 
1 50-bp paired-end reads. 

Genome Assembly and Analysis 

We used Newbler, ABySS ver. 1 .3, or Ray ver. 2.2.0 (Simpson 
et al. 2009; Boisvert et al. 2010) to assemble the reads 
(table 1), and Geneious ver. 5.4 (Biomatters Ltd., Auckland, 
New Zealand) or Sequencher ver. 4.5 (Gene Codes 
Corporation, Ann Arbor, Ml, USA) to guide finishing of the 
assemblies. Protein genes were annotated with DOGMA 
(Wyman et al. 2004), and predicted tRNAs and tmRNAs 
were identified with ARAGORN (Laslett and Canback 2004). 
Boundaries of the rRNA and ffs genes were delimited by direct 
comparison to sequenced diatom genomes with NCBI- 
BLASTN. We identified pseudogenes based on their BLASTN 



Genome Biol. Evol. 6(3):644-654. doi:10.1093/gbe/evu039 Advance Access publication February 23, 2014 



645 



Ruck et al. 



GBE 



Sequence Coverage (kb) 

0 20 40 60 80 100 120 140 160 

1 I I I I I T~ 1 1 - 

Leptocylindrus 
Coscinodiscus 




Fig. 1. — Sequence coverage by protein-coding, intergenic, and expanded (>1 kb in length) intergenic regions in diatom plastid genomes. Bars are 
drawn proportional to the genome size and show the fraction of the genome occupied by these three sequence categories. Taxa in boldface identify 
genomes sequenced for this study. Phylogenetic relationships were redrawn from Theriot et al. (201 0) and unpublished data. Taxa marked with a superscript 
"a" are dinoflagellates with diatom-derived plastids (Imanian et al. 2010). 



similarity to functional homologs(e-value< 1e~ 6 )and, in most 
cases, by their conserved positions in the genome. 

We used NCBI-BLASTP to search the nuclear genomes of 
Phaeodactylum tricornutum, Thalassiosira pseudonana, and 
Thalassiosira oceanica for genes missing from one or more plas- 
tid genomes. NCBI'sORF Finder (http://www.ncbi.nlm.nih.gov/ 
gorf/, last accessed March 24, 2014) was used to search inter- 
genic regions for open reading frames (ORFs) >100 amino 
acids in length. Intergenic sequences were considered 
unique if they had no match to a local database consisting of 
37 primary and secondary red plastid genomes and three 
plastid-localized diatom plasmids, based on a BLAST search 
with an e-value cutoff of 1 e~ 6 and the following search param- 
eters: word size = 9; reward = 2; mismatch penalty = -3; 
gap opening = 5, and gap extension = 2. Whole-genome 
alignments were performed using progressive MAUVE ver. 
2.3.1 with default parameters (Darling et al. 2010). 

Phylogenetic Analyses 

Sequence alignments for acpP and tsf included the diatoms 
sequenced for this study, all fully sequenced primary and sec- 
ondary red plastid genomes that contained the gene, and any 
nuclear homologs found in the sequenced diatom nuclear 
genomes. All genes were manually aligned using MacClade 
ver. 4.08 (Maddison and Maddison 2005). 

To account for taxon-specific amino acid compositional 
heterogeneity (e.g., in nuclear vs. plastid genes), tree inference 



was performed with NH-PhyloBayes ver. 0.2.3 using the time- 
and site-heterogenous model (CAT-BP) (Blanquart and Lartillot 
2006, 2008). We ran two MCMC chains for 2 x 10 3 (tsf) or 
3.2 x 10 3 (acpP) generations, sampling every tenth cycle. The 
number of categories was set to 30 (tsf) and 20 (acpP), cor- 
responding to the mean of the posterior distribution of this 
parameter estimated from a PhyloBayes analysis under the 
GTR+G+CAT substitution model (Lartillot and Philippe 2004; 
Lartillot et al. 201 3). Convergence and stationarity of runs was 
assessed through the built-in diagnostics in NH-PhyloBayes 
after discarding the first 10 3 samples as the bumin. 

Results and Discussion 

General Features 

Each plastid genome mapped as a single, circular chromo- 
some with two large inverted repeats (IR) separating small 
(SSC) and large single-copy (LSC) regions. The eight genomes 
ranged in size from 1 1 8 kb in Didymosphenia to 1 66 kb in 
Cylindrotheca (fig. 1). Diatom plastid genomes share a core 
set of 122 protein-coding genes, 3 rRNAs, 27 tRNAs, and 2 
additional RNA genes, tmRNA and ffs (supplementary 
table S1, Supplementary Material online). 

Nucleotide composition is highly conserved, with G+C (GC) 
content ranging from 29% to 32% across the eight genomes. 
GC content of protein-coding genes ranged from 30% 
to 33%, mirroring that of the overall genome, whereas 
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intergenic values were substantially lower — just 16-20% 
in most species. The Asterionellopsis, Eunotia, and 
Cylindrotheca genomes contained large amounts of compar- 
atively GC-"rich" expanded intergenic DNA (but still low: 
28% GC), driving up their overall intergenic GC content to 
as high as 27% in some species (supplementary table S2, 
Supplementary Material online). 

Genome Expansions 

Expansions of the Inverted Repeat Region 

The eight newly sequenced plastid genomes include the larg- 
est so-far sequenced from diatoms, substantially expanding 
their known size range. Expansion and contraction of the IR 
accounts for most of the size variation in angiosperm plastid 
genomes (Plunkett and Downie 2000), and similarly, the IR 
in diatoms varies in length by nearly 4-fold — from 7 kb in 
Didymosphenia to 27 kb in Eunotia (fig. 2). This variation re- 
flects several independent IR expansions and, very likely, con- 
tractions. Expansions have been bi-directional, incorporating 
parts of one or both of the LSC and SSC regions (fig. 2). In 
some cases, IR expansions have resulted in a large number of 
gene duplications. The IR expansions in T. pseudonana and 
T, oceanica resulted in the duplication of more than a dozen 
plastid genes (fig. 2). The largest IR expansion occurred in 
Eunotia, resulting in the duplication of >20 genes and 
an 1 8 kb increase in IR length compared with Asterionella 
(fig. 2). As a result, the IR (27 kb) is now larger than the SSC 
region (25 kb) in Eunotia. 

Expansions of Intergenic Regions 

Variation in plastid genome size primarily reflected differences 
in the amount of intergenic DNA, which comprises 1 2-39% 
(1 5—65 kb) of the genome in the eight newly sequenced 
genomes (fig. 1). Diatom plastid genomes are generally com- 
pact, with any given intergenic region rarely exceeding 500 bp 
in length. The plastid genomes of six species — T. oceanica, 
Asterionellopsis, Eunotia, Cylindrotheca, Kryptoperidinium, 
and Fistulifera — are, however, larger than average due to 
the presence of numerous expanded intergenic regions of 
>1 kb in length (fig. 1). These regions are spread across a 
dozen or so locations in the genomes and range in length 
from 1 to 1 0 kb, accounting for anywhere from 3 to 53 kb 
(2-32%) of the overall genome in these six species (fig. 1). 

While a small fraction (<3% in all cases) of these "extra" 
intergenic sequences can be traced to diatom plasmids, the 
majority are of unknown origin. Excluding plasmid-derived 
sequences, roughly 68-99% of the expanded intergenic 
sequences are species-specific, showing no similarity to 
sequenced primary or secondary red (including diatoms) 
plastid or plasmid genomes; roughly a quarter of the large 
Cylindrotheca plastid genome has no matches to GenBank 
sequences of any kind. The expanded intergenic sequences 



have significantly higher GC content (x = 29%) compared 
with the small, highly AT-rich (x=19%) intergenic regions 
ancestrally present in diatom plastid genomes, strongly sug- 
gesting that the expanded regions have a different ancestry 
(Lawrence and Ochman 1997; Ragan et al. 2006). Many of 
these regions also contain long, unique ORFs. Considering 
only those ORFs > 1 00 amino acids in length and with canon- 
ical start and stop codons, we found a total of 64 of them 
across the six species with expanded intergenic regions (fig. 1). 
ORFs ranged from 100 to 439 amino acids in length, and 
notably, just four of them were shared between any two 
genomes. 

Similar types of anonymous stretches of intergenic DNA 
have been found in other primary and secondary red plastid 
genomes, though not to this extent (Cattolico et al. 2008; 
Janouskovec et al. 2013). Additional comparative genomic 
data will help winnow in on the timing of these acquisitions 
and, hopefully, show whether they reflect an extreme case 
of differential loss of ancestral sequences or acquisitions of 
foreign DNA. A foreign origin seems much more plausible, 
however, considering that the differential loss model requires 
that the ORFs — found in no other primary or secondary red 
plastid genomes — were present in the ancestral diatom plastid 
genome, maintained or subsequently evolved an aberrantly 
high GC content, and experienced an exceptional pattern of 
repeated loss. 

Gene Acquisitions, Losses, and Functional Transfers to 
the Nucleus 

A total of 1 5 genes were variably present across the 1 6 species 
in our analysis (fig. 3). This pattern reflects 1 ) a dynamic history 
of gene losses and functional transfers to the nucleus across a 
broad range of phylogenetic depths, 2) gene duplications fol- 
lowed by differential losses of paralogs, and 3) acquisitions of 
foreign genes. In some cases, the small number of sequenced 
diatom nuclear genomes limited our ability to distinguish be- 
tween gene losses and functional transfers to the nucleus. 
Likewise, the number of genes dually resident in the plastid 
and nuclear genomes has almost certainly been underesti- 
mated. For example, the psb28 gene is present in the nuclear 
genome of T. pseudonana (Jiroutova et al. 2010), a transfer 
one would not have predicted based on the universal presence 
of psb28 in diatom plastid genomes, including that of 7". pseu- 
donana. We expect, therefore, that the patterns inferred here 
will be continuously refined in the coming years as more plas- 
tid and nuclear genome sequences become available. 

Widespread and Ongoing Gene Loss 

Although some genes have been lost or transferred to the 
nucleus just once, most of the variably present genes 
showed considerably more complex patterns involving inde- 
pendent losses across a broad range of phylogenetic depths. 
For example, the peroxiredoxin gene, basl , has been lost 
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Fig. 2. — Size variation of the inverted repeated region (IRa) across 1 6 fully sequenced plastid genomes in diatoms. Colored boxes circumscribe genes in 
various functional categories, with those above the line transcribed on the forward strand and vice versa for genes below the line. Maps are drawn to scale, 
and the gray box demarcates the core rns-trnl-trnA-rnl-rrn5 gene cluster conserved across the 16 genomes. Newly sequenced taxa from this study are in 
boldface and the nucleotide length of IRa are in parentheses beneath each taxon name. Double arrows delimit large putatively foreign sequence insertions. 



repeatedly from both red algal and chromalveolate plastid 
genomes (Douglas and Penny 1999; Glockner et al. 2000; 
Sanchez-Puerta et al. 2005), and this pattern extends to dia- 
toms as well. Assuming bas1 was present in the ancestral 
diatom plastid genome, the gene has been lost at least six 



separate times in taxa spanning the entire phylogeny 
(fig. 3). Although most of the genomes show no remaining 
trace of bash four distantly related taxa have retained 
what appear to be independently ameliorated pseudogene 
fragments, indicating that losses are ongoing in several 
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Fig. 3. — Evolutionary patterns of pseudogenization, loss, and gain of genes in diatom plastid genomes. The matrix shows 1 5 genes variably present 
among the sequenced genomes. The presence of nuclear gene copies is almost certainly underreported for lack of nuclear genome data in most species. 
Taxa in boldface identify genomes sequenced for this study. Phylogenetic relationships were redrawn from Theriot et al. (201 0) and unpublished data. Taxa 
marked with a superscript "a" are dinoflagellates with diatom-derived plastids(lmanian et al. 2010), and genes marked with a superscript "b" are of plasmid 
(serQ or unknown (tyrQ origin. 



lineages (fig. 3). Additional nuclear genomic data will help 
clarify whether the system of antioxidative protection provided 
by bas1 to plastids (Baier and Dietz 1997) has been lost, 
replaced, or handed over to the nucleus in some diatoms. 

The tRNA synthetase gene, syfB, has a similar history 
of repeated loss — in Asterionellopsis, deep within the 
Odontella+Thalassiosira clade, and in Coscinodiscus, which 
retains a highly degenerated pseudogene (fig. 3). The syfB 
and syfH genes are the last remaining tRNA synthetase 
genes in primary (B and H) and secondary (B only) red plastids, 
so their mere persistence in diatom plastid genomes is prob- 
ably more noteworthy than the seemingly inevitable losses 
recorded here. The syfB gene typically encodes the p subunit 
of Phenylalanyl-tRNA synthetase (PheRS), a heterotetramer 
with a- and p-subunits often encoded by separate genes 
(Safro et al. 2000). Organellar PheRS can function, however, 
as a single chimerically structured monomer with a- and 
p-domains encoded within a single gene (Safro et al. 
2000; Duchene et al. 2009). Thalassiosira pseudonana and 
T. oceanica both lack the plastid syfB gene but have a 
nuclear-encoded PheRS gene with this chimeric structure as 
well as signal and target peptides that predict plastid localiza- 
tion of the product (not shown). Thus, tRNA-Phe in diatom 
plastids, at least those lacking a syfB gene, appear to be 
loaded by a monomeric PheRS. The plastid-targeted PheRS 
gene appears to have been ancestrally present in diatoms 



but is missing from Phaeodactylum, which might account 
for the conservation of plastid syfB in araphid and raphid pen- 
nates (fig. 3). 

Several genes showed a pattern of recent, lineage-specific 
loss. For example, losses of the thiamine biosynthesis genes, 
thiG and this, were restricted to a single lineage, represented 
here by Fistulifera (fig. 3). Likewise, ycf88 is missing only from 
Leptocylindrus (fig. 3). This conserved hypothetical protein is 
known only from diatom plastid genomes. If ycf88 was pre- 
sent in the ancestral diatom plastid genome, its absence in 
Leptocylindrus represents a lineage-specific loss. Alternatively, 
ycf88 might have originated after the split between 
Leptocylindrus and the rest of the diatoms. 

Functional Transfers to the Nucleus 

The early stages of establishment of an organelle are charac- 
terized by massive gene losses and functional transfers from 
the endosymbiont to the host nuclear genome (Kleine et al. 
2009). Although this process has all but ceased in many or- 
ganelles (e.g., animal mitochondria, Boore 1999), gene losses 
are ongoing in several lineages, including the mitochondrial 
genomes of land plants (Adams and Palmer 2003). Despite 
many potential obstacles (Martin and Herrmann 1 998; Gruber 
et al. 2007), intracellular gene transfers from the plastid to the 
nuclear genome are quite common in diatoms (Oudot-Le Secq 
et al. 2007; Lommer et al. 2010; this study). 



Genome Biol. Evol. 6(3):644-654. doi:10.1093/gbe/evu039 Advance Access publication February 23, 2014 



649 



Ruck et al. 



GBE 



A total of five plastic! genes have been either functionally 
transferred to the nucleus or maintain dual residency in the 
plastid and nuclear genomes (fig. 3). Two of these transfers, 
involving petF and petJ, were previously known (Kilian and 
Kroth 2004; Lommer et al. 2010). The petF case is a special 
ecologically driven transfer restricted to a single species 
(Lommer et al. 2010), and it is now clear that petl was trans- 
ferred to the nucleus early on in diatom evolution, sometime 
after the split between Leptoc/lindrus and all other diatoms 
(fig. 3). 

Two genes involved in amino acid biosynthesis, ilvB and ilvH 
(the large and small subunits of acetolactate synthase) are 
widespread in primary and secondary red plastid genomes, 
absent only from haptophytes and previously sequenced dia- 
toms (Sanchez-Puerta et al. 2005; Wang et al. 2013). The 
highly disjunct distribution of these genes in diatom plastid 
genomes reflects a history of repeated loss, at least four of 
them among our small sample of diatom diversity (fig. 3). The 
nuclear genomes of T. pseudonana and Phaeodactylum con- 
tain plastid-like ilvB genes with signal and target peptides that 
predict plastid localization of the protein, so losses of ilvB from 
the plastid genome likely coincided with functional transfers 
into the nucleus. Unlike ilvB, the apparently single, deep loss of 
the other acetolactate synthase subunit, ilvH, was not accom- 
panied by a functional transfer to the nuclear genome. 

Dual residency of a gene in the organelle and nuclear ge- 
nomes is common in the early stages of intracellular transfer, 
but the transfer generally resolves with loss of the organellar 
or, in some cases, the nuclear copy of the gene (Adams et al. 
1 999). The translation factor gene, tsf, appears to represent 
an altogether different phenomenon (fig. 3). The gene is pre- 
sent in the plastid genomes of two distantly related raphid 
pennates (Eunotia and Fistulifera), the nuclear genomes of 
T. pseudonana and T. oceanica, and both the nuclear and 
plastid genomes of Phaeodactylum (Oudot-Le Secq et al. 
2007; Tanaka et al. 2011). Although the nuclear tsf genes 
in both Thalassiosira species have signal and transit peptides 
that predict targeting to the plastid, the nuclear copy in 
Phaeodactylum lacks both of these. While this would seem 
to suggest separate plastid-to-nuclear transfers in these two 
lineages, phylogenetic analysis resolved all nuclear copies into 
a strongly supported clade (fig. 4). The plastid-encoded tsf 
copies are also monophyletic and show levels of sequence 
divergence on par with other chromalveolates (fig. 4). Taken 
together, these results are consistent with a single deep plas- 
tid-to-nuclear transfer event followed by long-term conserva- 
tion of both the plastid and nuclear copies (for tens of millions 
of years), with repeated losses of the plastid copy (fig. 3) — at 
least ten of them when mapped onto a representative sample 
of diatom diversity (Theriot et al. 2010). Long-term mainte- 
nance of the plastid and nuclear copies may have led to func- 
tional differentiation of the nuclear copy in Phaeodactylum, 
which is highly divergent (fig. 4) and apparently no longer 
targeted to the plastid. Experimental data are necessary to 



determine the exact localization of the nuclear-encoded prod- 
uct and show whether it has, in fact, assumed a new or mod- 
ified function in Phaeodactylum. 

Gene Duplication 

Although gene duplication and divergence provide an impor- 
tant source of new genetic variation in nuclear genomes 
(Conant and Wolfe 2008) and a smattering of animal mito- 
chondrial genomes (Milani et al. 2013), divergent gene dupli- 
cates are rare in plastid genomes. Most duplicated plastid 
genes maintain their sequence identity through active recom- 
bination and gene conversion involving either duplicate copies 
of the genome within a cell or the recombinationally active IR 
(Chumley et al. 2006). Thus, gene duplicates in the plastid 
tend either to remain identical in sequence (Wakasugi et al. 
1994; Haberle et al. 2008; Guisinger et al. 2011) or suffer 
deterioration and loss of one copy (Poczai and Hyvonen 201 3). 

Within this context then, the presence of two highly diver- 
gent copies of the fatty acid biosynthesis gene, acpP, in several 
plastid genomes is exceptional, reflecting a history that is more 
characteristic of a dynamically evolving nuclear gene family 
than a typical organelle gene. Although relationships within 
the acpP gene tree were generally unsupported, phylogenetic 
analysis recovered a strongly supported acpP2 clade, to the 
exclusion of counterpart acpPi duplicates in Lithodesmium, 
Asterionella, and Eunotia (fig. 4) — a result that points to a 
relatively ancient (tandem) duplication followed by at least 
seven separate losses of one or both paralogs in the descen- 
dant lineages (fig. 3). These losses left some plastid genomes 
with both copies and others with one or, in some cases, none 
(fig. 3). Despite highly divergent amino acid sequences 
between plastid acpP paralogs (28-35% amino acid identity), 
differential retention of just the acpP2 copy in Asterionellopsis 
suggests that the two genes are functionally equivalent, a 
hypothesis that would be further supported if acpP7 is not 
found in the Asterionellopsis nuclear genome. The plastid 
acpP gene was, in fact, also duplicated into the nucleus, pos- 
sibly around the time of the plastid gene duplication (fig. 3). 
A few species have retained only the nuclear copy of the gene 
(fig. 3), which has signal and target peptides consistent with 
plastid localization of the product. 

Genes of Foreign or Uncertain Origin 

Although foreign sequence acquisitions by plastid genomes 
are rare, horizontal transfer has introduced novel genes and 
introns into a few algal plastid genomes, including the diatom, 
Seminavis (Brouard et al. 2008; Khan and Archibald 2008; 
Brembu et al. 2013). Some of these foreign sequences were 
acquired from plasmids (Imanian et al. 2010; Brembu et al. 
2013; Wang et al. 2013), whose cellular localization in dia- 
toms includes both the nucleus and plastid (Hildebrand et al. 
1992). Most notably, plasmids have introduced intact and 
putatively functional site-specific recombinase genes into the 
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plastid genomes of several diatoms (fig. 3). Recombinases 
enzymatically break and rejoin DNA and fall into two unre- 
lated families based on their DNA break-religate mechanism 
and the amino acid (serine or tyrosine) that mediates DNA 
cleavage (Grindley et al. 2006). They are essential for bacterial 
genome replication and differentiation (Nash 1996) and play 
important roles in the movement of transposons, plasmids, 
and bacteriophages within and between bacterial genomes 
(Smith and Thorpe 2002), making them highly plausible can- 
didates for horizontal transfer. 

The plastid genomes of five species contain one or two 
plasmid-derived serine recombinase (serQ genes or pseudo- 
genes (fig. 3). Although a previous survey did not find plas- 
mids outside of the raphid pennate lineage (Hildebrand et al. 
1991), the discovery of serC in Asterionellopsis predicts that 
araphid pennates contain plasmids as well. In addition to a 
serC pseudogene, the Cylindrotheca plastid genome also 
contains a short fragment with similarity to a newly discov- 
ered plasmid from our assembly (supplementary fig. S1, 
Supplementary Material online). Asterionellopsis, Eunotia, 
and Cylindrotheca also contain sequences matching noncod- 
ing sequences and ORFs from known diatom plasmids 
(Hildebrand etal. 1991, 1992). 

Although tyrC shares similar recombinase functions with 
serC, the origins of tyrC in select diatom (Imanian et al. 
2010), raphidophyte (Cattolico et al. 2008), and green algal 
(Brouard et al. 2008) plastid genomes are less clear. Like serC, 
however, tyrC appears to be restricted to the pennate diatom 



lineage (the Asterionellopsis+Didymosphenia clade in fig. 3). 
Moreover, in both Asterionellopsis and Heterosigma (another 
heterokont), tyrC is adjacent to ORFs with low similarity to 
known plasmid ORFs, pointing to a probable plasmid origin 
for tyrC in diatom plastid genomes. Still, tyrC is common in 
bacterial genomes and plasmids (Leplae et al. 2006; Van 
Houdt et al. 2012), and in light of the close associations be- 
tween diatoms and bacteria (Bowler et al. 2008; Amin et al. 
2012), a direct bacterial origin of tyrC cannot be ruled out. 
Indeed, bacterial HGT has introduced novel foreign genes into 
both primary (Janouskovec et al. 2013) and secondary (Khan 
et al. 2007) red algal plastid genomes. 

Genome Rearrangements 

Aside from sharing the common quadripartite plastid genome 
architecture, diatom plastid genomes are otherwise highly 
rearranged (Oudot-Le Secq et al. 2007) — a finding under- 
scored by the eight newly sequenced genomes. Illustrative 
of this, the plastid genomes of three representative diatoms 
had to be subdivided into 32 colinear gene blocks to create a 
whole-genome alignment (fig. 5). Some lineages have expe- 
rienced a higher frequency of rearrangements than others. For 
example, the genomes of two Thalassiosirales, T. pseudonana 
and T. oceanica, are highly rearranged relative to one another, 
whereas the genomes two raphid pennates, Didymosphenia 
and Phaeodactylum, are perfectly collinear (not shown). 
Because the single-copy regions of the genome are so 
highly rearranged, shifts in the IR boundaries result in the 
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annexation or loss of different sets of genes in different 
diatom lineages (fig. 2). Dense, focused sampling within par- 
ticular lineages will show whether rearrangements are associ- 
ated with rare recombination events across tRNAs or other 
small repetitive sequences (Turmel et al. 2002; Weng et al. 
2013). 

Conclusions 

Despite their ecological importance and substantial contribu- 
tion to global primary production, surprisingly little is known 
about the plastid genomes of diatoms. Our goal was to help 
fill this gap by doubling the number of fully sequenced plastid 
genomes and greatly expanding the phylogenetic breadth of 
sampled species. Our increased taxon sampling revealed levels 
of variation in plastid gene content, genome size, and genome 
architecture exceeding those in many other plastid-bearing 
lineages. Angiosperms, for example, are similar to diatoms 
in both taxonomic diversity and geologic age, but with just 
a few noteworthy exceptions (Cai et al. 2008; Sloan et al. 
2012), their plastid genomes are characterized by long-term 
evolutionary stasis (Jansen and Ruhlman 201 2). Diatom plastid 
genomes, by contrast, exhibit complex patterns of gene gains 
and losses and, more compelling still, a propensity to acquire 
and retain foreign DNA. 

In many cases, our inferences, especially with respect to 
gene gains and losses, hinged heavily on our taxonomic sam- 
pling. For example, the Eunotia plastid genome is a hoarder, 
holding onto genes that have been tossed out in most other 
species. This single genome highlighted patterns of loss far 
more complex than would have been evident if it had not 



been sequenced. In light of this, and given that diversity esti- 
mates for diatoms number into the hundreds of thousands of 
species, we expect that diatom plastid genomes hold many 
more surprises, that the full pan-genome is still unknown for 
diatom plastids, and that the inferences made here will be 
substantially modified in the coming years. Finally, important 
phylogenetic relationships within diatoms remain unresolved 
or poorly supported (Theriot et al. 201 0), severely constraining 
current and future comparative genomic studies. Efforts to 
better characterize the phylogenetic relationships of diatoms 
will pay great dividends to these and other emergent fields of 
research on this diverse and ecologically important lineage. 

Supplementary Material 

Supplementary tables S1 and S2 and figure S1 are available at 
Genome Biology and Evolution online (http://www.gbe. 
oxfordjournals.org/). 
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